Welcome back to deep learning. So today we want to discuss the basics of reinforcement
learning. We will look into how we can teach a system to play different games and we will
start with a very first introduction into sequential decision making. So here I have
a couple of slides for you. So you see that the topic is reinforcement learning and we
want to go ahead and talk about sequential decision making. Later in this course we will
talk also about reinforcement learning and all the details and we will also look into
deep reinforcement learning but today we will only look into sequential decision making.
Okay, sequential decision making. Well, we wanted to play a couple of games and the simplest
game that you can potentially think of is that you just pull a couple of levers and
if you try to formalize this then you end up in the so-called multi-armed bandit problem.
So let's do a couple of definitions. We need some actions and we formalize this as choosing
an action A at time t from a set of actions capital A. So these are a discrete set of
possible actions that we can do and if we choose an action then this has some implications
and if you choose that action A t then you will generate some reward Rt but the relation
between the action and the reward is probabilistic which means that there is a probably different
unknown probability density function that describes the actual relation between the
action and the reward. So if you think of your multi-armed bandit you have a couple
of slot machines, you pull one of the levers and then this generates some reward but maybe
all of these slot machines are the same and probably they are not so each arm that you
could potentially pull has a different probability of generating some reward Rt. Now you want
to be able to pick an action and in order to do so we define a so-called policy. So
the policy is a way of formalizing how to choose an action and the policy is essentially
also a probability density function that describes us the likelihood of choosing some action
and the policy is essentially the way how we want to influence the game. So the policy
is something that lies in our hand and we can define this policy and of course we want
to make this policy optimal with respect to playing the game. So what's the key element?
Well what we want to achieve? We want to achieve a maximum reward and in particular we not
just want to have the maximum reward for playing the game in every time step but instead we
want to compute the maximum expected reward over time. So we produce an estimate of the
reward that is going to be produced and we compute a kind of mean value over this because
this allows us to then estimate which actions produce what kind of rewards if we play this
game over a long time. So this is a difference to supervised learning because here we are
not saying do this action or do that action but instead we have to determine by our training
algorithm which actions to choose and obviously we can make mistakes and the aim is then to
choose the actions that will over time then produce the maximum expected reward. So it's
not so important if we lose in one step if we then on average still can generate a high
average reward. So the problem here is of course that the expected value of our reward
is not known in advance. So this is the actual problem of the reinforcement learning that
we want to try to estimate this expected reward and the associated probabilities. So what
we can do is we can form R as an one-hot encoded vector which reflects which action of A has
actually caused the reward. If we do so we can estimate the probability density function
online using an averaging and we introduce this as the function Q of A and this is the
so-called action value function which essentially changes with every new information that we
observe. So how can we do this? Well there is an incremental way of computing our Qt
of A and we can very easily show this we defined Qt as the sum over all the time steps so Qt
plus one of A equals the sum over all the time steps t and the obtained rewards and
of course you divide by t. Now we can show that this can be split up so we can take out
the last element of the sum which is RT and then only have the sum run from one to t minus
one and if we do so we can then also introduce the term t minus one because if you introduce
t minus one here and divide by one over t minus one this will cancel out to one so this
Presenters
Zugänglich über
Offener Zugang
Dauer
00:15:18 Min
Aufnahmedatum
2020-06-14
Hochgeladen am
2020-06-14 19:16:40
Sprache
en-US
Deep Learning - Reinforcement Learning Part 1
This video explains the concepts of sequential decision making and the multi-armed bandit problem.
Further Reading:
A gentle Introduction to Deep Learning